DNA was extracted from 197 isolates of Ascochyta rabiei (collected in 2020) and sent for Whole-Genome-Sequencing (WGS) at the Australian Genome Research Facility (AGRF, Melbourne) on 1 lane of a NovaSeq flowcell, producing 150 bp paired-end reads (run name CAGRF20114478).
Details of the sequenced isolates is provided in (Table 1).
WGS data processing, mapping and variant calling were performed on the QCIF Awoonga and Griffith Gowonda HPC Clusters (using PBSPro scheduler), using Snippy v4.6.0, a rapid haploid variant calling and core genome alignment. The pipeline uses FreeBayes v1.3.5 (Garrison and Marth 2012) and other tools to assign variant probability scores and call variants.
Consider the following options for file download and processing:
$TMPDIR (limited to 20GB), then cleanup (using --cleanup in Snippy) and copy just the important results (fastq.genozip files, QC reports and Snippy results) to scratch folder and into CloudStor (with rclone)scratch folder , then cleanup and upload into CloudStor.At the current pipeline, all processing is done at the the read files were downloaded and then following a standard naming conventions (to start from a pair of SampleID_R#.fastq.gz file), to make it much easier to process all files with the same script, using parameters to specify batch name, read length, sequencing platform and other potential variables.
Detailed methods, including code for running each of the analyses steps are provided in the associated A_rabiei_WGS_analysis GitHub repository.
Install needed software in a conda environment on the new Gowonda HPC cluster.
# download conda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# install conda
bash Miniconda3-latest-Linux-x86_64.sh
# initialise conda
source ~/.bashrc
# add channels and set priorities
conda config --add channels conda-forge
conda config --append channels bioconda
# install extra packages to the base environment
conda install -n base libgcc gnutls libuuid readline cmake git tmux libgfortran parallel mamba gawk pigz rename genozip autoconf sshpass
# install snippy (need to fix internet connection to gowonda2 - use patched netcheck in ~/bin)
# source ~/.proxy
CONDA_NAME=snippy
conda create -n $CONDA_NAME snippy sra-tools bcbio-gff libgd xorg-libxpm bdeops \
libpng libjpeg-turbo jpeg snpsift rename biobambam bwa-mem2 sambamba \
libtiff genozip parallel qualimap multiqc bbmap fastp genozip
# Clean extra space
# conda update -n base conda
conda clean -y --all
# cpanm git://github.com/IdoBar/XML-DOM-XPath-0.14@patch1
# cpanm --force Bio::SeqIO
# Install pdfx to parse the report and download the files, see https://stackoverflow.com/a/33173484
pip install pdfxREF_DIR=$HOME/data/reference_genomes/Ascochyta_reference_genomes # on Griffith HPC
# REF_DIR=$HOME/data/reference_genomes # on Awoonga HPC
WORK_DIR=$HOME/scratch/A_rabiei_WGS_2021 # on Griffith HPC
# WORK_DIR=$HOME/90days/data/A_rabiei_WGS_2021 # on Awoonga# download reference genomes
source ~/.proxy # on Griffith HPC
wget -c ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/011/695/GCF_004011695.1_Arabiei_Me14/GCF_004011695.1_Arabiei_Me14_genomic.fna.gz -O $REF_DIR/GCF_004011695.1_Arabiei_Me14_genomic.fna.gz
wget -c ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/011/695/GCF_004011695.1_Arabiei_Me14/GCF_004011695.1_Arabiei_Me14_genomic.gff.gz -O $REF_DIR/GCF_004011695.1_Arabiei_Me14_genomic.gff.gz
# convert gff3 (from AUGUSTUS) and fasta files to genbank format
GENOME="$REF_DIR/GCF_004011695.1_Arabiei_Me14_genomic"
ls -1 $GENOME*.gz | parallel pigz -d {}
ln -s $GENOME.fna $GENOME.fa
genozip --make-reference $GENOME.fa
# cd $HOME/scratch/data/reference_genomes/Ascochyta_reference_genomes/
# download conversion script
# wget https://raw.githubusercontent.com/chapmanb/bcbb/master/gff/Scripts/gff/gff_to_genbank.py
# python gff_to_genbank.py $GENOME.gff3 $GENOME.fastaDownload the list of raw .fastq.gz files from AGRF and upload to CloudStor.
mkdir -p $WORK_DIR
cd !$
SSHUSER=NiloofarVaghefi1
SSHPASS=Ashtad3721
AGRF_DIR=files/AGRF_CAGRF20114478_H3HGFDSX2
# get the list of files from AGRF - doesn't work on Griffith HPC because of proxy settings - try using corkscrew (http://wiki.kartbuilding.net/index.php/Corkscrew_-_ssh_over_https)
sshpass -p $SSHPASS ssh $SSHUSER@agrf-data.agrf.org.au "ls -1 $AGRF_DIR" > AGRF_CAGRF20114478_H3HGFDSX2_files.list
# download md5 file
sshpass -p $SSHPASS rsync -avzh $SSHUSER@agrf-data.agrf.org.au:$AGRF_DIR/checksums.md5 ./
# rsync NiloofarVaghefi1@agrf-data.agrf.org.au:files/AGRF_CAGRF20114478_H3HGFDSX2/'*'
#rsync -avh -e ssh IdoBar1@agrf-data.agrf.org.au:files/AGRF_CAGRF19461_HGMLMAFXY ~/scratch/data/A_rabiei_WGS/
# download 2019 sequencing
# AGRF
rclone copy -P --max-depth 1 CloudStor:Shared/GRIFFITH-Ford-Crop-Genetics-Lab/A_rabiei_WGS/AGRF_CAGRF19461_HGMLMAFXY/AGRF_BT2_process_04_04_2019 AGRF_CAGRF19461 --include AGRF*.fastq.gz
# AgVic
rclone copy -P --max-depth 1 CloudStor:Shared/GRIFFITH-Ford-Crop-Genetics-Lab/A_rabiei_WGS/AgVic_WGS2018 AgVic_WGS20
18
# Macrogen
rclone copy -P --max-depth 1 CloudStor:Shared/GRIFFITH-Ford-Crop-Genetics-Lab/A_rabiei_WGS/Macrogen_sequences_1702KHP-0
164 Macrogen_sequences_1702KHP-0164Adaptors needed to be removed, as well as very low quality bases/reads, so trimming was performed with fastp v0.20.1, which also produced QC figures and reports.
Raw files quality after adaptor trimming was assessed with FastQC v0.11.8 (Andrews 2014; Chen et al. 2018).
The trimmed reads were mapped to the A. rabiei reference genome, strain Me14 (GCA_004011695.1) using Snippy v4.6.0. Snippy is a wrapper that makes use of popular bioinformatics tools, such as bwa-mem v0.7.17-r1188 (Li 2013) to map the reads to the reference genome, followed by several commands of samtools v1.12 (Li et al. 2009) to specify a Read Group for each sample (provided to Snippy with the --rgid flag), mark duplicates and convert the alignments into coordinate-sorted, indexed BAM files.
The alignment files are then processed by Freebayes v1.3.5 (Garrison and Marth 2012) to call variants from all samples and bcftools v1.12 and snpEff v5.0 (Cingolani et al. 2012) to filter and annotate the variants and retain only high-quality variants (based on minimum depth and genotype quality thresholds) that are common to all samples and referred to as core SNPs.
Mapping statistics were obtained with Qualimap v.2.2.2-dev (Okonechnikov et al. 2016) and consolidated along with pre and post-trimming QC measures into a single, interactive report for each batch using MultiQC v1.10.1 (Ewels et al. 2016) (see MultiQC report.
Processing of the files was done on QCIF Awoonga/Griffith Gowonda HPC clusters
# work on Awoonga (because Griffith HPC can't download the fastq files directly...)
CONDA_NAME=/export/home/s2978925/miniconda3/envs/snippy
conda activate $CONDA_NAME
BBMAP_REF=$(find $CONDA_PREFIX -wholename "*resources/adapters.fa")
# Prepare the BBduk commands
DATE=`date +%d_%m_%Y`
BATCH=AGRF
RUN="${BATCH}_snippy_${DATE}" # day of run was 02_02_2019
RUN_DIR=$WORK_DIR/${RUN}
mkdir -p $RUN_DIR
cd $RUN_DIR
# GENOME="$HOME/scratch/data/reference_genomes/Ascochyta_reference_genomes/A.rabiei_me14"
# READ_LENGTH=150
# PLOIDY=1
NCORES=12
RAM=28
# LANES=$( ls -1 ../*.fastq.gz | egrep -o "_L[0-9]+?_" | sort | uniq | wc -l )
RGPM=NovaSeq
RGPL=ILLUMINA
RGPU=H3HGFDSX2
RGCN=AGRF
COMMON_ID="DO16094321_HFKYMALXX"
case $RGPM in
NextSeq )
CLUMP_PARAM="dupedist=40 spany" ;;
HiSeq2500 )
CLUMP_PARAM="dupedist=40" ;;
HiSeq3000 )
CLUMP_PARAM="dupedist=2500" ;;
HiSeq4000 )
CLUMP_PARAM="dupedist=2500" ;;
NovaSeq )
CLUMP_PARAM="dupedist=12000" ;;
* )
CLUMP_PARAM="" ;;
esac
# where to store read files
# FQ_DIR="$RUN_DIR/fastq_files"
mkdir -p $RUN_DIR/CloudStor_copy
# Prepare a general array PBS script
echo '#!/bin/bash
#PBS -V
cd $WORKDIR
>&2 echo Current working directory: $PWD
source ~/.bashrc'"
conda activate $CONDA_PREFIX
set -Eeo pipefail
gawk -v ARRAY_IND=\$PBS_ARRAY_INDEX 'NR==ARRAY_IND' \$CMDS_FILE | bash" > ${WORK_DIR}/array.pbspro
# Then perform adapter trimming, mapping and variant calling (snippy); decide which intermediate files can be removed
# grep "_R1.fastq.gz" $WORK_DIR/AGRF_CAGRF20114478_H3HGFDSX2_files.list | sort | parallel -k --dry-run --rpl "{infile} s:_R1:_R#:; uq()" --rpl "{files} s:_R1:_R?:; uq()" --rpl "{sample} s:_H3HGFDSX2_[ACGT\-]+_L002_.+::" "rclone copy CloudStor:Shared/GRIFFITH-Ford-Crop-Genetics-Lab/A_rabiei_WGS/AGRF_CAGRF20114478_H3HGFDSX2/AGRF_CAGRF20114478/ --include \"{files}\" ./{sample} ; if [[ \$( md5sum {sample}/{files} | cut -d' ' -f1 ) == \$( egrep '{sample}_' $WORK_DIR/checksums.md5 | cut -d' ' -f1 ) ]]; then echo 'Files passed md5 check'; else exit 1 ; fi ; java -ea -XX:ParallelGCThreads=2 -Xmx4g -Xms4g -cp /home/ibar/.pyenv/versions/miniconda-latest/envs/snippy/opt/bbmap-38.90-1/current/ jgi.BBDuk ref=$BBMAP_REF ktrim=r k=23 mink=11 hdist=1 qtrim=rl trimq=10 tpe tbo minlen=30 ziplevel=8 threads=\$[NCPUS - 2] ow in={sample}/{infile} out={sample}/{sample}_R#.trimmed.fastq.gz stats={sample}.stats ow ; mkdir -p trimmed_reads/{sample}_QC ; fastqc -t \$[NCPUS - 2] -o trimmed_reads/{sample}_QC {sample}/{sample}*.trimmed.fastq.gz ; snippy --force --cleanup --cpus \$[NCPUS - 2] --ram $[RAM-2] --outdir snippy_{sample} --report --rgid {sample} --ref $GENOME.fna --R1 {sample}/{sample}_R1.trimmed.fastq.gz --R2 {sample}/{sample}_R2.trimmed.fastq.gz; rm -r {sample}/ " > $RUN_DIR/$RUN.cmds
# Then perform adapter trimming (fastp), mapping and variant calling (snippy); decide which intermediate files can be removed
# grep "_R1.fastq.gz" $WORK_DIR/AGRF_CAGRF20114478_H3HGFDSX2_files.list | sort | parallel --dry-run --rpl "{file2} s:_R1:_R2:; uq()" --rpl "{files} s:_R1:_R?:; uq()" --rpl "{sample} s:_H3HGFDSX2_[ACGT\-]+_L002_.+::" "rclone copy CloudStor:Shared/GRIFFITH-Ford-Crop-Genetics-Lab/A_rabiei_WGS/AGRF_CAGRF20114478_H3HGFDSX2/AGRF_CAGRF20114478/ --include \"{files}\" ./{sample} ; cd {sample}; if [[ \$( egrep '{sample}_H3H' $WORK_DIR/checksums.md5 | md5sum -c | cut -f2 -d ' ' | uniq) == 'OK' ]]; then >&2 echo 'Files passed md5 check'; else >&2 echo 'Files did not pass md5 check, please download again'; exit 99 ; fi ; fastp -i {} -I {file2} -c --adapter_fasta $BBMAP_REF -l 30 -p -w \$[NCPUS - 2] -z 7 -o {sample}_R1.trimmed.fastq.gz -O {sample}_R2.trimmed.fastq.gz -h {sample}.fastp.html -j {sample}.fastp.json; fastqc -t \$[NCPUS - 2] -o {sample}_QC {sample}*.trimmed.fastq.gz ; snippy --force --cleanup --cpus \$[NCPUS - 2] --ram $[RAM-2] --outdir snippy_{sample} --report --rgid {sample} --ref $GENOME.fna --R1 {sample}_R1.trimmed.fastq.gz --R2 {sample}_R2.trimmed.fastq.gz" > $RUN_DIR/$RUN.cmds
# On Griffith HPC: perform adapter trimming (fastp), mapping and variant calling (snippy); decide which intermediate files can be removed
grep "_R1.fastq.gz" $WORK_DIR/AGRF_CAGRF20114478_H3HGFDSX2_files.list | sort | parallel -k --dry-run --rpl "{file2} s:_R1:_R2:; uq()" --rpl "{files} s:_R1:_R?:; uq()" --rpl "{sample} s:_H3HGFDSX2_[ACGT\-]+_L002_.+::" "mkdir -p ./{sample} ;cd {sample}; ln -sf $WORK_DIR/AGRF_CAGRF20114478/{files} ./; if [[ \$( egrep '{sample}_H3H' $WORK_DIR/checksums.md5 | md5sum -c | cut -f2 -d ' ' | uniq) == 'OK' ]]; then >&2 echo 'Files passed md5 check'; else >&2 echo 'Files did not pass md5 check, please download again'; exit 99 ; fi ; fastp -i {} -I {file2} -c --adapter_fasta $BBMAP_REF -l 30 -p -w \$[NCPUS - 2] -z 7 -o {sample}_R1.trimmed.fastq.gz -O {sample}_R2.trimmed.fastq.gz -h {sample}.fastp.html -j {sample}.fastp.json; snippy --force --cleanup --cpus \$[NCPUS - 2] --ram $[RAM-2] --outdir snippy_{sample} --report --rgid {sample} --ref $GENOME.fna --R1 {sample}_R1.trimmed.fastq.gz --R2 {sample}_R2.trimmed.fastq.gz " > $RUN_DIR/$RUN.cmds
# run just fastp trim
grep "_R1.fastq.gz" $WORK_DIR/AGRF_CAGRF20114478_H3HGFDSX2_files.list | sort | parallel -k --dry-run --rpl "{file2} s:_R1:_R2:; uq()" --rpl "{files} s:_R1:_R?:; uq()" --rpl "{sample} s:_H3HGFDSX2_[ACGT\-]+_L002_.+::" "mkdir -p ./{sample} ;cd {sample}; ln -sf $WORK_DIR/AGRF_CAGRF20114478/{files} ./; if [[ \$( egrep '{sample}_H3H' $WORK_DIR/checksums.md5 | md5sum -c | cut -f2 -d ' ' | uniq) == 'OK' ]]; then >&2 echo 'Files passed md5 check'; else >&2 echo 'Files did not pass md5 check, please download again'; exit 99 ; fi ; fastp -i {} -I {file2} -c --adapter_fasta $BBMAP_REF -l 30 -p -w \$[NCPUS - 2] -z 7 -o {sample}_R1.trimmed.fastq.gz -O {sample}_R2.trimmed.fastq.gz -h {sample}.fastp.html -j {sample}.fastp.json" > $RUN_DIR/fastp_trim.cmds
# find . -maxdepth 1 -name "*_R1.fastq.gz" | parallel --dry-run --rpl "{files} s:_R1:_R?:; uq()" --rpl "{sample} s:_H3HGFDSX2_[ACGT\-]+_L002_.+::" "genozip {files} -f -@ \$[NCPUS - 2] -2 -E $GENOME.ref.genozip -o $RUN_DIR/CloudStor_copy/{sample}_H3HGFDSX2.fq.genozip; rm {files} trimmed_reads/{sample}_R?.recal.trimmed.fastq.gz" > genozip_clean.cmds
# GENOZIP_CLEAN=$(qsub -J1-$(cat $RUN_DIR/genozip_clean.cmds | wc -l) -l select=1:ncpus=6:mem=12GB,walltime=1:00:00 -N genozip_clean -A qris-gu -v "CMDS_FILE=$RUN_DIR/genozip_clean.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# recalibrate requires a reference
# bbduk.sh in=stdin.fq out=stdout.fq int recalibrate | filterbytile.sh in={infile} out=stdout.fq int=f | clumpify.sh in=stdin.fq out=stdout.fq int dedupe usetmpdir optical $CLUMP_PARAM adjacent
# add --cleanup to snippy
# make a folder for the log files
mkdir -p $RUN_DIR/pbs_logs
cd !$
# run fastp trim
FASTP_TRIM=$(qsub -J1-$(cat $RUN_DIR/fastp_trim.cmds | wc -l) -l select=1:ncpus=6:mem=8GB,walltime=1:00:00 -N fastp_trim -A qris-gu -v "CMDS_FILE=$RUN_DIR/fastp_trim.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# save failed commands to file
grep "job killed" fastp_trim.e$FASTP_TRIM.* | awk -F [.:] '{print $3}' | awk 'NR==FNR{pos[$1]; next}FNR in pos' - $RUN_DIR/fastp_trim.cmds > $RUN_DIR/fastp_trim.cmds.$FASTP_TRIM.failed
# rerun failed jobs
cd pbs_logs
FASTP_TRIM=$(qsub -J1-$(cat $RUN_DIR/fastp_trim.cmds.$FASTP_TRIM.failed | wc -l) -l select=1:ncpus=8:mem=4GB,walltime=2:00:00 -N fastp_trim -A qris-gu -v "CMDS_FILE=$RUN_DIR/fastp_trim.cmds.$FASTP_TRIM.failed","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# test run
TEST_SNIPPY=$(qsub -J195-196 -l select=1:ncpus=${NCORES}:mem=${RAM}GB,walltime=5:00:00 -N ${RUN:0:11} -A qris-gu -v "CMDS_FILE=$RUN_DIR/$RUN.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# run for all files
STEP_SIZE=10
TRIM_SNIPPY=$(qsub -J1-$(cat $RUN_DIR/$RUN.cmds | wc -l) -l select=1:ncpus=${NCORES}:mem=${RAM}GB,walltime=5:00:00 -N ${RUN:0:11} -A qris-gu -v "CMDS_FILE=$RUN_DIR/$RUN.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
#TRIM_SNIPPY2=$(qsub -J2-$(cat $RUN_DIR/$RUN.cmds | wc -l):$STEP_SIZE -l select=1:ncpus=${NCORES}:mem=${RAM}GB,walltime=1:00:00 -N ${RUN:0:11} -W depend=afterok:$TRIM_SNIPPY[] -A qris-gu -v "CMDS_FILE=$RUN_DIR/$RUN.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# check for successful commands and save them to a file
cd $RUN_DIR
egrep "ExitStatus:[0]" pbs_logs/*.e$TRIM_SNIPPY.* | awk -F [.:] '{print $3}' | awk 'NR==FNR{pos[$1]; next}FNR in pos' - $RUN.cmds > $RUN.$TRIM_SNIPPY.succeed
# check for failed commands and save them to a file
egrep "ExitStatus:[^0]" pbs_logs/*.e$TRIM_SNIPPY.* | awk -F [.:] '{print $3}' | awk 'NR==FNR{pos[$1]; next}FNR in pos' - $RUN.cmds > $RUN.$TRIM_SNIPPY.failed
# rerun failed commands (modify pbs file if needed to increase resources)
cd pbs_logs
TRIM_SNIPPY2=$(qsub -J1-$(cat $RUN_DIR/$RUN.$TRIM_SNIPPY.failed | wc -l) -l select=1:ncpus=${NCORES}:mem=${RAM}GB,walltime=5:00:00 -N ${RUN:0:11} -A qris-gu -v "CMDS_FILE=$RUN_DIR/$RUN.$TRIM_SNIPPY.failed","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# repeat as needed
egrep "ExitStatus:[^0]" pbs_logs/*.e$TRIM_SNIPPY2.* | awk -F [.:] '{print $3}' | awk 'NR==FNR{pos[$1]; next}FNR in pos' - $RUN.$TRIM_SNIPPY.failed > $RUN.$TRIM_SNIPPY2.failed
# rerun failed jobs
cd pbs_logs
TRIM_SNIPPY3=$(qsub -J1-$(cat $RUN_DIR/$RUN.$TRIM_SNIPPY2.failed | wc -l) -l select=1:ncpus=${NCORES}:mem=${RAM}GB,walltime=5:00:00 -N ${RUN:0:11} -A qris-gu -v "CMDS_FILE=$RUN_DIR/$RUN.$TRIM_SNIPPY2.failed","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# repeat as needed
egrep "ExitStatus:[^0]" pbs_logs/*.e$TRIM_SNIPPY3.* | awk -F [.:] '{print $3}' | awk 'NR==FNR{pos[$1]; next}FNR in pos' - $RUN.$TRIM_SNIPPY2.failed > $RUN.$TRIM_SNIPPY3.failed
# rerun failed jobs
cd pbs_logs
TRIM_SNIPPY_NEW=$(qsub -J1-$(cat $RUN_DIR/$NEW_RUN.bash | wc -l) -l select=1:ncpus=${NCORES}:mem=${RAM}GB,walltime=7:00:00 -N ${RUN:0:11} -A qris-gu -v "CMDS_FILE=$RUN_DIR/$NEW_RUN.bash","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# find incomplete jobs
cd $RUN_DIR
find . -name "snps.aligned.fa" | cut -f 3 -d / | gawk '
{printf "%s|", $1}' | sed 's/|$//' | egrep -vf - $RUN_DIR/$RUN.cmds > $RUN_DIR/$RUN.cmds.failed
# rerun failed jobs
cd pbs_logs
TRIM_SNIPPY3=$(qsub -J1-$(cat $RUN_DIR/$RUN.cmds.failed | wc -l) -l select=1:ncpus=${NCORES}:mem=32GB,walltime=2:00:00 -N ${RUN:0:11} -v "CMDS_FILE=$RUN_DIR/$RUN.cmds.failed","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# find duplicate sample names that didn't work initially
egrep "AR0052_H3H|AR0039_H3H|AR0242_H3H" $RUN_DIR/$RUN.cmds > $RUN_DIR/$RUN.cmds.retry
TRIM_SNIPPY_RETRY=$(qsub -J1-$(cat $RUN_DIR/$RUN.cmds.retry | wc -l) -l select=1:ncpus=${NCORES}:mem=32GB,walltime=2:00:00 -N ${RUN:0:11} -v "CMDS_FILE=$RUN_DIR/$RUN.cmds.retry","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# remove empty log files
find . -type f -size 0 -name "${RUN:0:11}.*" -exec rm {} +
# Run snippy-core on the output files
cd $RUN_DIR
CORE_JOB=snippy_core_"$(date +%d_%m_%Y)"
SNPY_FOLDS=$( find . -type d -name "snippy_AR*" | xargs )
SNPY_REF="'$( find . -type d -name "snippy_AR*" | head -n1 )/reference/ref.fa'"
# CORE_CMD=$( tail -n1 ${RUN}.bash )
SNPY_CORE=$( echo "cd $( pwd ) ; source ~/.bashrc ; conda activate $CONDA_PREFIX ; snippy-core --prefix=CORE_JOB --ref $SNPY_REF $SNPY_FOLDS " | qsub -V -l select=1:ncpus=${NCORES}:mem=${RAM}GB,walltime=2:00:00 -N ${CORE_JOB:0:11} | egrep -o "[0-9]{7}" ) # 5248661.pbsserver
# cd $RUN_DIR
# find . -maxdepth 1 -type d -name "AR*" | cut -f2 -d"/" | parallel --dry-run "mkdir -p {}/{}_QC; cd {}; fastp -i {}*_R1.trimmed.fastq.gz -I {}*_R2.trimmed.fastq.gz -w $[NCPUS - 2] -h {}_QC/{}.fastp.html -j {}_QC/{}.fastp.json -A -Q -G -m --stdout" > fastp_trim_qc.cmds
# FASTP_QC=$(qsub -J1-$(cat $RUN_DIR/fastp_trim_qc.cmds | wc -l) -l select=1:ncpus=${NCORES}:mem=${RAM}GB,walltime=5:00:00 -N ${RUN:0:11} -A qris-gu -v "CMDS_FILE=$RUN_DIR/fastp_trim_qc.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# record the array ID: 5247580[] Macrogen batch on Gowonda
# multiqc report
MULTIQC_JOB="${RUN}_QC"
echo "cd $( pwd ) ; multiqc -i $MULTIQC_JOB -o $MULTIQC_JOB ." | qsub -V -l select=1:ncpus=12:mem=4GB,walltime=1:00:00 -N ${MULTIQC_JOB:0:11} -W depend=afterok:$QUALIMAP_JOB_ID # 5248672.pbsserver
# collect trimming summary
grep "Result:" ${RUN}.log | gawk 'NR%2==0' | gawk 'BEGIN{per=0}{per+=gensub(/.([0-9]+\.[0-9]+).+/, "\\1", "1", $7); avg=per/NR}END{printf("Average percentage of bases kept after trimming: %.2f%% \n",avg)}' > $RUN.bbduk.stats
# Average percentage of bases kept after trimming: 97.76%
# Number of files processed: 72
grep "Input:" ${RUN}.log | gawk 'NR%2==0' | gawk 'BEGIN{reads=0};{reads+=$2; avg=reads/NR}END{printf("Average number of reads per file: %.2f \nTotal number of reads: %d\nNumber of files processed: %d\n",avg, reads,NR)}' >> $RUN.bbduk.stats
# Average number of reads per file: 4243748.83
# Total number of reads: 305549916
# find and remove empty files
find . -size 0 -exec rm {} +
# Check that all jobs finished successfuly
find . -regextype posix-egrep -regex '\./.*\.e[0-9]{7}.*' | xargs grep "ExitStatus" # *m.e$JOB_ID.*
# remove temp files
rm raw.sam raw_R*.fq.gz
# Done!Reads are mapped to the A. rabiei reference genome assembled and annotated by Dr. Fredrick Mobegi (the CCDM, Curtin University, add NCBI accession when available) using bwa-mem2 v2.2.1 (Vasimuddin et al. 2019). The alignment files were then coordinate-sorted and had PCR duplicates marked using bamsormadup from biobambam2 v2.0.182 (Tischler and Leonard 2014).
To improve performance, temporary files were written to a local lscratch folder on the computing node.
REF_DIR=$HOME/data/reference_genomes/Ascochyta_reference_genomes/Updated_ME14_reference_and_annotations # on Griffith HPC
# REF_DIR=$HOME/data/reference_genomes # on Awoonga HPC
GENOME="$REF_DIR/ArME14"
ln -s $REF_DIR/Ascochyta_rabiei_ArME14.scaffolds.fa $GENOME.fa
ln -s $REF_DIR/ArME14_all_annotations.updated.gff3 $GENOME.gff3
gff2bed < $GENOME.gff3 > $GENOME.bed
bwa-mem2 index $GENOME.fa
samtools faidx $GENOME.faPrepare and run the pipeline to trim the reads
WORK_DIR=$HOME/scratch/A_rabiei_WGS_2021
CONDA_NAME=/export/home/s2978925/miniconda3/envs/snippy
conda activate $CONDA_NAME
BBMAP_REF=$(find $CONDA_PREFIX -wholename "*resources/adapters.fa")
# Prepare the commands
DATE=`date +%d_%m_%Y`
BATCH=All_2018_2020
RUN="${BATCH}_${DATE}" # day of run was 30_09_2021
RUN_DIR=$WORK_DIR/${RUN}
mkdir -p $RUN_DIR
cd $RUN_DIR
# GENOME="$HOME/scratch/data/reference_genomes/Ascochyta_reference_genomes/A.rabiei_me14"
# READ_LENGTH=150
# PLOIDY=1
NCORES=12
RAM=28
# LANES=$( ls -1 ../*.fastq.gz | egrep -o "_L[0-9]+?_" | sort | uniq | wc -l )
RGPM=NovaSeq
RGPL=ILLUMINA
# RGPU=H3HGFDSX2
# RGCN=AGRF
# COMMON_ID="DO16094321_HFKYMALXX"
# where to store read files
# FQ_DIR="$RUN_DIR/fastq_files"
# mkdir -p $RUN_DIR/CloudStor_copy
# Prepare a general array PBS script
echo '#!/bin/bash
#PBS -V
cd $WORKDIR
>&2 echo Current working directory: $PWD
source ~/.bashrc'"
conda activate $CONDA_PREFIX
set -Eeo pipefail
gawk -v ARRAY_IND=\$PBS_ARRAY_INDEX 'NR==ARRAY_IND' \$CMDS_FILE | bash" > ${WORK_DIR}/array.pbspro
# Prepare a parallel array PBS script
echo '
ARRAYID=$( echo $PBS_JOBID | egrep -o "^[0-9]+" )
gawk -v ARRAY_IND=$PBS_ARRAY_INDEX -v step=$STEP_SIZE '"'NR>=ARRAY_IND && NR<(ARRAY_IND+step)'"' $CMDS_FILE | parallel -k --joblog $PBS_O_WORKDIR/${PBS_JOBNAME}.p$ARRAYID.$PBS_ARRAY_INDEX' | cat <(head -n -1 ${WORK_DIR}/array.pbspro) - > ${WORK_DIR}/parallel_array.pbspro
# On Griffith HPC:
# link all fastq files and rename them the same way
# AgVic batch
ln -s $WORK_DIR/AgVic_WGS2018/*.fastq.gz ./
rename -v 's/S[0-9]+_L003_//' *.fastq.gz
# AGRF 2019
ln -s $WORK_DIR/AGRF_CAGRF19461/*.fastq.gz ./
# Macrogen
ln -s $WORK_DIR/Macrogen_sequences_1702KHP-0164/*.fastq.gz ./
# 2020 AGRF
ln -s $WORK_DIR/AGRF_CAGRF20114478/*.fastq.gz ./
rename -v 's/H3HGFDSX2_[ACGT-]+_L002_//' *.fastq.gz
# ignore contaminated/shallow files
# IGNORE_SAMS='11_R|5B_R|2_R|11_R|15CUR005_R|AGRF_18_R|AGRF_08_R|F15023_R|20_R'
printf "AR0106\nAR0107\nAR0113\nAR0118\n11\n5B\n2\n11\n15CUR005\nAGRF_18\nAGRF_08\nF15023\n20" > ignore_samples.txt
IGNORE_RE=$(awk '{printf "^%s_R|",$1}' ignore_samples.txt | sed 's/|$//')
# perform adapter trimming (fastp)
ls -1 *_R1.fastq.gz | egrep -v $IGNORE_RE | sort | parallel -k --dry-run --rpl "{file2} s:_R1:_R2:; uq()" --rpl "{files} s:_R1:_R?:; uq()" --rpl "{sample} s:_R1.+::" "mkdir -p ./{sample} ;cd {sample}; ln -sf $RUN_DIR/{files} ./; fastp -i {} -I {file2} -c --adapter_fasta $BBMAP_REF -l 30 -p -w \$[NCPUS - 2] -z 7 -o {sample}_R1.trimmed.fastq.gz -O {sample}_R2.trimmed.fastq.gz -h {sample}.fastp.html -j {sample}.fastp.json; " > $RUN_DIR/fastp_trim.cmds
# find . -maxdepth 1 -name "*_R1.fastq.gz" | parallel --dry-run --rpl "{files} s:_R1:_R?:; uq()" --rpl "{sample} s:_H3HGFDSX2_[ACGT\-]+_L002_.+::" "genozip {files} -f -@ \$[NCPUS - 2] -2 -E $GENOME.ref.genozip -o $RUN_DIR/CloudStor_copy/{sample}_H3HGFDSX2.fq.genozip; rm {files} trimmed_reads/{sample}_R?.recal.trimmed.fastq.gz" > genozip_clean.cmds
# GENOZIP_CLEAN=$(qsub -J1-$(cat $RUN_DIR/genozip_clean.cmds | wc -l) -l select=1:ncpus=6:mem=12GB,walltime=1:00:00 -N genozip_clean -A qris-gu -v "CMDS_FILE=$RUN_DIR/genozip_clean.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# recalibrate requires a reference
# bbduk.sh in=stdin.fq out=stdout.fq int recalibrate | filterbytile.sh in={infile} out=stdout.fq int=f | clumpify.sh in=stdin.fq out=stdout.fq int dedupe usetmpdir optical $CLUMP_PARAM adjacent
# add --cleanup to snippy
# make a folder for the log files
mkdir -p $RUN_DIR/pbs_logs
cd !$
# run fastp trim
FASTP_TRIM=$(qsub -J1-$(cat $RUN_DIR/fastp_trim.cmds | wc -l) -l select=1:ncpus=8:mem=4GB,walltime=3:00:00 -N fastp_trim -v "CMDS_FILE=$RUN_DIR/fastp_trim.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# combine stderr files to log
cat *.e$FASTP_TRIM.* > fastp_trim.$FASTP_TRIM.log
# save failed commands to file
grep "job killed" fastp_trim.e$FASTP_TRIM.* | awk -F [.:] '{print $3}' | awk 'NR==FNR{pos[$1]; next}FNR in pos' - $RUN_DIR/fastp_trim.cmds > $RUN_DIR/fastp_trim.cmds.$FASTP_TRIM.failed
# rerun failed jobs
# align the trimmed reads to the genome
cd $RUN_DIR
ls -1 *_R1.fastq.gz | egrep -v $IGNORE_RE | sort | parallel -k --dry-run --rpl "{file2} s:_R1:_R2:; uq()" --rpl "{files} s:_R1:_R?:; uq()" --rpl "{sample} s:_R1.+::" "cd {sample}; bwa-mem2 mem -R \"@RG\tID:{sample}\tSM:{sample}\" -t \$[NCPUS - 2] $GENOME.fa {sample}_R1.trimmed.fastq.gz {sample}_R2.trimmed.fastq.gz | bamsormadup inputformat=sam threads=\$[NCPUS - 2] > {sample}.dedup.rg.csorted.bam" > $RUN_DIR/align_reads.cmds
# test run
TEST_BWA=$(qsub -J1-2 -l select=1:ncpus=8:mem=8GB,walltime=2:00:00 -N align_reads -v "CMDS_FILE=$RUN_DIR/align_reads.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# run for all files
ALIGN_BWA=$(qsub -J3-$(cat $RUN_DIR/align_reads.cmds | wc -l) -l select=1:ncpus=8:mem=8GB,walltime=2:00:00 -N align_reads -v "CMDS_FILE=$RUN_DIR/align_reads.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
STEP_SIZE=5 # the processing is very quick (<20 minutes) and it would be more efficient for each job to process 5 commands at a time
IGNORE_SAMS=$(awk '{printf "^%s\t|",$1}' $RUN_DIR/ignore_samples.txt | sed 's/|$//')
# merge sam files of duplicated samples and rename to isolate name
gawk -F "\t" '$3 ~ /Ascochyta rabiei/ && $6 ~ /Illumina/' $WORK_DIR/WGS_sequencing_information_summary.txt | egrep -v "$IGNORE_SAMS" | tail -n+2 | sort -k2 | gawk -F "\t" -v ORS="" 'BEGIN{cmd="picard MergeSamFiles USE_THREADING=true QUIET=true COMPRESSION_LEVEL=0 O=/dev/stdout"}NR==1{printf "%s I=%s/%s.dedup.rg.csorted.bam", cmd, $1, $1; isolate=$2; split($6, a , " "); next}isolate!=$2{printf " | picard AddOrReplaceReadGroups SO=coordinate I=/dev/stdin CREATE_INDEX=true ID=%s LB=%s PL=Illumina PU=%s SM=%s O=%s.csorted.bam\n%s I=%s/%s.dedup.rg.csorted.bam", isolate, isolate, a[2], isolate, isolate, cmd, $1, $1; isolate=$2; split($6, a , " "); next}isolate==$2{printf " I=%s/%s.dedup.rg.csorted.bam", $1, $1}END{printf " | picard AddOrReplaceReadGroups SO=coordinate I=/dev/stdin CREATE_INDEX=true ID=%s LB=%s PL=Illumina PU=%s SM=%s O=%s.csorted.bam\n", isolate, isolate, a[2], isolate, isolate}' > $RUN_DIR/merge_bams.cmds
# test run (if not fast enough replace with sambamba merge and samtools addreplacerg )
TEST_MERGE=$(qsub -J1-2 -l select=1:ncpus=8:mem=16GB,walltime=2:00:00 -N merge_bams -v "CMDS_FILE=$RUN_DIR/merge_bams.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# run for all files
MERGE_BAMS=$(qsub -J3-$(cat $RUN_DIR/merge_bams.cmds | wc -l) -l select=1:ncpus=8:mem=16GB,walltime=1:30:00 -N merge_bams -v "CMDS_FILE=$RUN_DIR/merge_bams.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# find failed runs and repeat
cd $RUN_DIR
find . -maxdepth 1 -name "*.csorted.bam" -size 1M | cut -f 2 -d "/" | grep -f - merge_bams.cmds > merge_bams.cmds.failed
# repeat run
echo "cd $RUN_DIR ; source ~/.bashrc; conda activate $CONDA_NAME; cat merge_bams.cmds.failed | parallel" | qsub -l select=1:ncpus=8:mem=16GB,walltime=2:00:00 -N merge_bams
# index BAMs (if not done previously while adding Read Groups)
BAM_INDEX=$( echo "cd $RUN_DIR ; source ~/.bashrc; conda activate $CONDA_NAME; ls -1 *.csorted.bam | parallel 'sambamba index -t4 {} '" | qsub -V -l select=1:ncpus=16:mem=16GB,walltime=2:00:00 -N index_bams | egrep -o "^[0-9]+")
# Run qualimap on bam files (files need to be sorted)
# QUALIMAP_JOB="qualimap_${RUN}"
# Qualimap bamqc
cd $RUN_DIR
ls -1 *_R1.fastq.gz | egrep -v $IGNORE_RE | sort | parallel -k --dry-run --rpl "{file2} s:_R1:_R2:; uq()" --rpl "{files} s:_R1:_R?:; uq()" --rpl "{sample} s:_R1.+::" "unset DISPLAY; cd $RUN_DIR/{sample}; qualimap bamqc -bam {sample}.dedup.rg.csorted.bam --java-mem-size=4G -c -gff $GENOME.bed -outdir $RUN_DIR/{sample}/{sample}_bamqc" > qualimap_bamqc.cmds
cd $RUN_DIR/pbs_logs
# submit to the cluster
STEP_SIZE=5 # the processing is very quick (<20 minutes) and it would be more efficient for each job to process 5 commands at a time
QUALIMAP=$(qsub -J1-$(cat $RUN_DIR/qualimap_bamqc.cmds | wc -l):$STEP_SIZE -l select=1:ncpus=8:mem=16GB,walltime=2:00:00 -N qualimap_bamqc -v "CMDS_FILE=$RUN_DIR/qualimap_bamqc.cmds","WORKDIR=$RUN_DIR","STEP_SIZE=$STEP_SIZE" ${WORK_DIR}/parallel_array.pbspro | egrep -o "^[0-9]+")
# Multi-bamqc: create/copy a file describing the sample names (sample_info.txt) to the parent folder and use it in qualimap
find `pwd` -name "*_bamqc" | gawk '{sample_name=gensub(/.+\/(.+)_bamqc/, "\\1", $1); printf "%s\t%s\n", sample_name, $1}' > multibamqc.samples
QUALIMAP_MULTI=$( echo "cd $RUN_DIR ; source ~/.bashrc; conda activate $CONDA_NAME; unset DISPLAY ; find `pwd` -name \"*_bamqc\" | gawk '{sample_name=gensub(/.+\/(.+)_bamqc/, \"\\\1\", \$1); printf \"%s\t%s\n\", sample_name, \$1}' > multibamqc.samples ; qualimap multi-bamqc --java-mem-size=4G -d multibamqc.samples -outformat PDF:HTML -outdir ${BATCH}_bamqc " | qsub -V -l select=1:ncpus=${NCORES}:mem=32GB,walltime=5:00:00 -N ${QUALIMAP_JOB:0:11} | egrep -o "^[0-9]+")
# remove empty log files
find . -type f -size 0 -name "${RUN:0:11}.*" -exec rm {} +
# multiqc report
MULTIQC_JOB=QC_$(date +%d_%m_%Y)
echo "source ~/.bashrc; conda activate $CONDA_NAME; cd $RUN_DIR; set -Eeo pipefail; multiqc -i $MULTIQC_JOB -o $MULTIQC_JOB ." | qsub -V -l select=1:ncpus=12:mem=4GB,walltime=2:00:00 -N ${MULTIQC_JOB:0:11} # 5248672.pbsserver
# collect trimming summary
grep "Result:" ${RUN}.log | gawk 'NR%2==0' | gawk 'BEGIN{per=0}{per+=gensub(/.([0-9]+\.[0-9]+).+/, "\\1", "1", $7); avg=per/NR}END{printf("Average percentage of bases kept after trimming: %.2f%% \n",avg)}' > $RUN.bbduk.stats
# Average percentage of bases kept after trimming: 97.76%
# Number of files processed: 72
grep "Input:" ${RUN}.log | gawk 'NR%2==0' | gawk 'BEGIN{reads=0};{reads+=$2; avg=reads/NR}END{printf("Average number of reads per file: %.2f \nTotal number of reads: %d\nNumber of files processed: %d\n",avg, reads,NR)}' >> $RUN.bbduk.stats
# Average number of reads per file: 4243748.83
# Total number of reads: 305549916
# find and remove empty files
find . -size 0 -exec rm {} +
# Check that all jobs finished successfully
find . -regextype posix-egrep -regex '\./.*\.e[0-9]+.*' | xargs grep "ExitStatus" # *m.e$JOB_ID.*
# remove temp files
rm raw.sam raw_R*.fq.gz
# Done!Error rates were estimated by individually processing the duplicate samples and assessing the rate of mismatch variant calls within replicates of the same isolate.
To process mapping table, copy the multisampleBamQcReport.html file from the server to the local analysis folder
Samples with less than 20% mapping and x15 coverage were removed from the rest of the analysis (see highlighted rows in Table 2 and more details in the MultiQC report).
PLOIDY=1
DATE=`date +%d_%m_%Y`
RUN=FB_all_${DATE} # day of run was 02_02_2019
mkdir -p $WORK_DIR/${RUN}
cd !$
# JOB_NAME="BBmap_me14_vars"
GENOME="$HOME/scratch/data/reference_genomes/Ascochyta_reference_genomes/A.rabiei_me14"
NCORES=8
# Distributed freebayes (each node runs freebayes-parallel on one contig)
# download script
wget https://raw.githubusercontent.com/freebayes/freebayes/master/scripts/split_ref_by_bai_datasize.py ~/bin/
chmod +x ~/bin/split_ref_by_bai_datasize.py
conda install -n $CONDA_NAME numpy scipy
# split each contig/chromosome to smaller 1e6 bits
~/bin/split_ref_by_bai_datasize.py -s 4e5 -r $GENOME.fa.fai $RUN_DIR/AR0321.csorted.bam > $RUN_DIR/ArME14_target_regions_chr.bed
ln -s $RUN_DIR/ArME14_target_regions_chr.tsv $RUN_DIR/ArME14_target_regions_chr.bed
# prepare commands
BAM_FILES=$( find $RUN_DIR -maxdepth 1 -name "*.csorted.bam" -size +1M | gawk -vORS=" " '1' )
cut -f1 $GENOME.fa.fai | parallel --dry-run "freebayes-parallel <(grep '{}' ArME14_target_regions_chr.tsv | gawk '{printf \"%s:%s-%s\n\", \$1, \$2, \$3}') \$NCPUS -f $GENOME.fa -p1 $BAM_FILES > {}.combined.vcf" > freebayes.cmds
cd $RUN_DIR/pbs_logs
# send to the cluster
FREEBAYES=$(qsub -J1-$(cat $RUN_DIR/freebayes.cmds | wc -l) -l select=1:ncpus=8:mem=64GB,walltime=24:00:00 -N freebayes -v "CMDS_FILE=$RUN_DIR/freebayes.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# check status of jobs
seq 1 $(cat $RUN_DIR/freebayes.cmds | wc -l) | parallel "qstat -fx $FREEBAYES[{}] | head"
# repeat failed jobs
FREEBAYES_FAILED=$(qsub -J7-17:10 -l select=1:ncpus=8:mem=64GB,walltime=36:00:00 -N freebayes -v "CMDS_FILE=$RUN_DIR/freebayes.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# merge ind files to a single bam file (keep submission ids)
BAMS=$( find $RUN_DIR -maxdepth 1 -name "*.csorted.bam" -size +1M | gawk -v ORS=" " '{print "I="$1}' )
# BAM_FILES=$( find $RUN_DIR -maxdepth 1 -name "*.csorted.bam" -size +1M | gawk -vORS=" " '1' )
# cd $RUN_DIR/pbs_logs
# merge with sambamba (memory allocation issues)
# FINAL_MERGE=$( echo "source ~/.bashrc; conda activate $CONDA_NAME; cd $RUN_DIR; set -Eeo pipefail; sambamba merge -t $[NCPUS - 2] All_samples.csorted.combined.bam $BAM_FILES; sambamba depth base --combined All_samples.csorted.combined.bam | cut -f 1-3 | tail -n +2 | coverage_to_regions.py $GENOME.fa.fai 50 > ArME14_targets.regions " | qsub -V -l select=1:ncpus=${NCORES}:mem=32GB,walltime=2:00:00 -N Final_merge | egrep -o "^[0-9]+") # 5248672.pbsserver
# merge with picard (not needed)
FINAL_MERGE=$( echo "source ~/.bashrc; conda activate $CONDA_NAME; cd $RUN_DIR; set -Eeo pipefail; picard MergeSamFiles CREATE_INDEX=true R=$GENOME.fa SO=coordinate $BAMS O=All_samples.csorted.combined.bam" | qsub -V -l select=1:ncpus=${NCORES}:mem=32GB,walltime=3:00:00 -N Final_merge | egrep -o "^[0-9]+")
# calculate coverage
ln -s $RUN_DIR/All_samples.csorted.combined.bai $RUN_DIR/All_samples.csorted.combined.bam.bai
# conda deactivate
~/bin/split_ref_by_bai_datasize.py -s 100e6 -r $GENOME.fa.fai $RUN_DIR/All_samples.csorted.combined.bam > $RUN_DIR/All_ArME14_targets_regions.bed
# run freebayes on combined BAM (stringent)
cut -f1 $GENOME.fa.fai | parallel --dry-run "freebayes-parallel <(grep '{}' $RUN_DIR/All_ArME14_targets_regions.bed | gawk '{printf \"%s:%s-%s\n\", \$1, \$2, \$3}') \$NCPUS -f $GENOME.fa -p1 -g 100000 -0 -C 5 $RUN_DIR/All_samples.csorted.combined.bam > $RUN_DIR/All_samples.{}.combined.stringent.vcf" > $RUN_DIR/freebayes_stringent.cmds
# run freebayes on combined BAM
FREEBAYES_STRING=$(qsub -J1-$(cat $RUN_DIR/freebayes_stringent.cmds | wc -l) -l select=1:ncpus=8:mem=64GB,walltime=36:00:00 -N freebayes_string -v "CMDS_FILE=$RUN_DIR/freebayes_stringent.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
# save read groups to file
sambamba view -H All_samples.csorted.combined.bam | grep '^@RG' > read_groups.txt
# filter the files and combine to a single VCF file,
CONCAT_FB=$(echo "set -Eeo pipefail; source ~/.bashrc; conda activate $CONDA_NAME; cd $RUN_DIR ; cat $RUN_DIR/All_samples.*.combined.stringent.vcf | vcffirstheader | vcfstreamsort -w 1000 | vcfuniq > A_rabiei_2018_2020.bwa.fb.stringent.vcf" | qsub -V -l select=1:ncpus=8:mem=16GB,walltime=20:00:00 -N FB_concat | egrep -o "[0-9]+" )
MIN_DP=5
ls -1 ArME14_*.combined.vcf | parallel --dry-run "printf \"{}\t%s\t%s\t%s\t%s\n\" \$(cat {} | grep -c -v '^#') \$(cat {} | SnpSift filter \"( GEN[?].DP > $MIN_DP ) & ( GEN[?].GT != './.' )\" | gawk '{if (\$0 ~ /^#/ || (length(\$4)==1 && length(\$5)==1)); print \$0}' | grep -c -v '^#') \$(cat {} | SnpSift filter \"( GEN[?].DP > $MIN_DP ) & ( GEN[?].GT != './.' ) & ( QUAL > 20 )\" | gawk '{if (\$0 ~ /^#/ || (length(\$4)==1 && length(\$5)==1)); print \$0}' | grep -c -v '^#') \$(cat {} | SnpSift filter \"( GEN[?].DP > $MIN_DP ) & ( GEN[?].GT != './.' ) & ( QUAL > 30 )\" | gawk '{if (\$0 ~ /^#/ || (length(\$4)==1 && length(\$5)==1)); print \$0}' | grep -c -v '^#') > {}.stats" > $RUN_DIR/vcf_stats.cmds
cd $RUN_DIR/pbs_logs
# Calculate contig stats
VCF_STATS=$(qsub -J1-$(cat $RUN_DIR/vcf_stats.cmds | wc -l) -l select=1:ncpus=8:mem=16GB,walltime=2:00:00 -N vcf_stats -v"CMDS_FILE=$RUN_DIR/vcf_stats.cmds","WORKDIR=$RUN_DIR" ${WORK_DIR}/array.pbspro | egrep -o "^[0-9]+")
cat <(printf "file\ttotal_snps\tDP${MIN_DP}_filtered_snps\tQUAL20_filtered_snps\tQUAL30_filtered_snps\n") <(cat $RUN_DIR/*stringent.vcf.stats) > vcf_stats_stringent_$DATE.txt
# find and remove empty files
find . -size 0 -exec rm {} +
# check jobs completion
find . -regextype posix-egrep -regex '\./.*\.e[0-9]{7}.*' | xargs grep "ExitStatus"
# check for success/failure of FB commands
tail -n +2 $FB.cmds.p*.* | cut -f7,9 | gawk '$1==1' > $FB.failed.cmds
tail -n +2 $FB.cmds.p*.* | cut -f7,9 | gawk '$1==0' > $FB.successful.cmds
rm $FB.cmds.p*.*The Illumina sequencing data was also used to identify regions that vary in their copy number in the genomes of each isolate. An overview and comparison of copy number variation (CNV) calling approaches can be found in (Gabrielaite et al. 2021; Coutelier et al. 2022).
We performed CNV analysis using several methods:
Hecaton (a modified workflow based on v0.5.1) is a computational workflow that integrates the results of different algorithms that utilise the sequencing depth and break points (paired reads that don’t overlap properly) at a region, as well as the allelic ratio (if a region appears to be heterozygote it must be duplicated in a haploid genome) (Wijfjes et al. 2019).
# prepare Hecaton
source ~/.proxy
cd ~/.nextflow/assets/IdoBar
git clone https://github.com/IdoBar/hecaton.git
WORK_DIR=$HOME/scratch/A_rabiei_WGS_2021
GENOME_DIR="$HOME/data/reference_genomes/Ascochyta_reference_genomes/ArME14_v2"
GENOME="ArME14.v2.fasta"
HECATON_HOME="$HOME/.nextflow/assets/IdoBar/hecaton"
FQ_DIR="$WORK_DIR/All_2018_2020_30_09_2021"
DATE=`date +%d_%m_%Y`
RUN=Hecaton_all_${DATE} # day of run was 05_07_2021
mkdir -p $WORK_DIR/${RUN}
cd !$
# prepare reference genome
conda activate aDNA
ln -s $GENOME_DIR/$GENOME ./
bash $HECATON_HOME/bash/preprocess.sh $GENOME
picard CreateSequenceDictionary R=$GENOME O=$GENOME.dict
conda deactivate
# prepare read files
mkdir fastq_dir
printf "AR0106\nAR0107\nAR0113\nAR0118\n11\n5B\n2\n15CUR005\nAGRF_18\nAGRF_08\nF15023\n20" > ignore_samples.txt
IGNORE_RE=$(awk '{printf "/%s_R|",$1}' ignore_samples.txt | sed 's/|$//')
# perform adapter trimming (fastp)
ls -1 $FQ_DIR/*.fastq.gz | egrep -v $IGNORE_RE | sort | parallel -k --dry-run ln -s {} fastq_dir/
# run Hecaton
conda activate nf
nextflow run -with-singularity ~/.singularity/cache/hecaton_latest.sif -c ~/.nextflow/GU.config -profile gowonda -w hecaton_workdir $HOME/.nextflow/assets/IdoBar/hecaton/nextflow/hecaton.nf --genome_file $GENOME --reads "fastq_dir/*_R{1,2}.fastq.gz" --manta_config $HECATON_HOME/docker/configManta_weight_1.py.ini --output_dir hecaton_output --model_file $HECATON_HOME/models/random_forest_model_concat_A_thaliana_ColxCvi_O_sativa_Suijing18_coverage_10x_insertions_balanced_subsample.pkl -resume
# add read depth and convert to vcf
# prepare input file
find `pwd`/hecaton_output/aligned_reads/*.bam | egrep -v "discordants|splitters" | sort | gawk '{sample_name=gensub(/.+\/(.+).bam/, "\\1", $1); printf "%s\t%s\n", sample_name, $1}' > bam_inputs.tsv
find `pwd`/hecaton_output/random_forest_calls -name "*unfiltered.bedpe" | sort | gawk '{sample_name=gensub(/.+\/(.+)_probabilities.+.bedpe/, "\\1", $1); printf "%s\t%s\n", sample_name, $1}' > unfiltered.bedpe.files
# join the tables
# awk 'NR==FNR{a[$3,$4]=$1OFS$2;next}{$6=a[$1,$2];print}' OFS='\t' fileb filea
# awk 'NR==FNR{a[$1]=$3; next} ($7 in a) {print $0, a[$7]}' TableB TableA > TableC
join -1 1 -t $'\t' -o 1.1,1.2,2.2 <(sort bam_inputs.tsv -k1) <(sort unfiltered.bedpe.files -k1) > bed2vcf_input_files.tsv
# run the pipeline and convert BEDPE files to VCFs
nextflow run -with-singularity ~/.singularity/cache/hecaton_latest.sif -c ~/.nextflow/GU.config -profile gowonda $HOME/.nextflow/assets/IdoBar/hecaton/nextflow/add_read_depth.nf --input bed2vcf_input_files.tsv --fasta $GENOME --outdir duphold_unfiltered_output --merge -resume
# Combine/merge the VCF files
# "fix" merge vcf script
ln -s $HOME/conda3/envs/hecaton/lib/libcrypto.so.1.1 $HOME/conda3/envs/hecaton/lib/libcrypto.so.1.0.0
ln -s $HOME/.nextflow/assets/IdoBar/hecaton/scripts/utils $HOME/.nextflow/assets/IdoBar/hecaton/scripts/genotype/
# create the input file
ls -1 *.vcf.gz > hecaton_vcf_files.txt
# send the job to the cluster
HEC_MERGE_JOB=$( echo "source ~/.bashrc; conda activate hecaton; cd $PWD; set -Eeo pipefail; $HOME/.nextflow/assets/IdoBar/hecaton/scripts/genotype/merge_vcf_files.py -i hecaton_vcf_files.txt -f $HOME/scratch/A_rabiei_WGS_2021/Hecaton_all_05_07_2022/$GENOME.fai -o A_rabiei_merged_samples.vcf -r 0.5 " | qsub -V -l select=1:ncpus=8:mem=4GB,walltime=1:00:00 -N cn_MOPS | egrep -o "^[0-9]+")
# run for the 2020 samples only
ls -1 AR0*.vcf.gz > AR_2020_vcfs.txt
# send the job to the cluster
HEC_MERGE_JOB=$( echo "source ~/.bashrc; conda activate hecaton; cd $PWD; set -Eeo pipefail; $HOME/.nextflow/assets/IdoBar/hecaton/scripts/genotype/merge_vcf_files.py -i AR_2020_vcfs.txt -f $HOME/scratch/A_rabiei_WGS_2021/Hecaton_all_05_07_2022/$GENOME.fai -o A_rabiei_2020_merged_samples.vcf -r 0.5 " | qsub -V -l select=1:ncpus=8:mem=4GB,walltime=1:00:00 -N cn_MOPS | egrep -o "^[0-9]+")cn.MOPS (v1.42.0) is a Bioconductor R package that models the depth of coverage distribution across samples in each genomic loci and applies a Bayesian approach to infer integer copy number for each location (Klambauer et al. 2012).
See the cn.mops.R Gist for an example of the workflow.
# setup R environment on the cluster
CONDA_NAME=snippy
conda activate $CONDA_NAME
mamba install bioconductor-cn.mops r-biocmanager r-tidyverse r-optparse
# install missing package
Rscript "BiocManager::install('GenomeInfoDbData')"
# prepare the work folder
WORK_DIR=$HOME/scratch/A_rabiei_WGS_2021
BAM_DIR="$WORK_DIR/All_2018_2020_30_09_2021"
PLOIDY=1
DATE=`date +%d_%m_%Y`
RUN=cnMOPS_all_${DATE} # day of run was 06_07_2021
mkdir -p $WORK_DIR/${RUN}
cd !$
# JOB_NAME="BBmap_me14_vars"
NCORES=8
mkdir bam_files
ln -s $BAM_DIR/*.csorted.bam* bam_files/
# download the PBS Rscript
wget https://gist.githubusercontent.com/IdoBar/0d39c714469219abf7932d2c1666ed25/raw/64e3057703e7c1a4a12bcc6d619d56a9308dcf73/cn.mops.R
# send the job to the cluster
CNMOPS_JOB=$( echo "source ~/.bashrc; conda activate $CONDA_NAME; cd $PWD; set -Eeo pipefail; Rscript cn.mops.R -i bam_files -o output " | qsub -V -l select=1:ncpus=${NCORES}:mem=32GB,walltime=5:00:00 -N cn_MOPS | egrep -o "^[0-9]+")Alternatively, cn.MOPS can be run on the HPC by starting R and running the following:
library(cn.mops)
library(tidyverse)
library(optparse)
option_list = list(
make_option(c("-i", "--input_dir"), type="character", default=NULL,
help="input folder containing bams files", metavar="character"),
make_option(c("-o", "--out_dir"), type="character", default="output",
help="output folder name [default= %default]", metavar="character"),
make_option(c("-t", "--threads"), type="character", default=as.integer(Sys.getenv("NCPUS")),
help="number of parallel threads to use [default= %default]", metavar="integer")
)
opt_parser = OptionParser(option_list=option_list)
opt = parse_args(opt_parser)
if (is.null(opt$input_dir)){
print_help(opt_parser)
stop("At least one argument must be supplied (input dir).\n", call.=FALSE)
}
if (!dir.exists(opt$out_dir)) dir.create(opt$out_dir)
bams <- list.files(opt$input_dir, ".+.bam$", full.names = TRUE)
bam_names <- sub(".bam", "", basename(bams))
bamDataRanges <- getReadCountsFromBAM(bams, bam_names, parallel=as.integer(opt$threads)) # extract read counts from BAMs
resHaplo <- haplocn.mops(bamDataRanges, parallel=as.integer(opt$threads)) # identify CN regions
if (length(resHaplo@cnvs)==0) stop("No CNV regions detected, exiting.\n", call.=FALSE)
results_haplo <- calcIntegerCopyNumbers(resHaplo) # convert to integer CN
# plot outputs for each sample
iwalk(bam_names, .f = ~{
png(file.path(opt$out_dir, glue::glue("cn_mops_{.x}_haplo.png")), width=1800, height=1200)
segplot(results_haplo, sampleIdx=.y)
dev.off()
})
# Haploid Results
results_haplo_df <- as_tibble(cnvr(results_haplo)) %>%
# set_names(sub("^X", "", names(.))) %>%
rename(chrom=seqnames) %>%
mutate(name="", score=0, across(!1:5, .fns = ~as.integer(sub("CN", "", .x)))) %>%
select(chrom, start, end, name, score, strand, width, everything()) %>%
arrange(chrom) %>%
set_names(c("chrom", "start", "end", "name", "score", "strand", "width",bam_names))
# fix samples names (in case they started with a number)
# save to file
write_tsv(results_haplo_df, file="output/cn_mops_haplo_df.bed")Sarek (v2.7.2) is a Nextflow pipeline to call variants from WGS data using established tools and methods for data QC, alignment, variant calling, annotation and evaluation (such as BWA-MEM, Freebayes, GATK, Strelka2, and for CNV, Control-FREEC) (Garcia et al. 2020).
PLOIDY=1
DATE=`date +%d_%m_%Y`
RUN=Sarek_all_${DATE} # day of run was 02_02_2019
mkdir -p $WORK_DIR/${RUN}
cd !$
# JOB_NAME="BBmap_me14_vars"
GENOME_DIR="$HOME/data/reference_genomes/Ascochyta_reference_genomes/ArME14_v2"
NCORES=8CNVpytor (v2.7.2) is a Python pipeline to call CNVs from WGS data. (Suvakov et al. 2021)
CONDA_NAME=cnv
mamba create -n $CONDA_NAME cnvpytor jupyterlab samtools
WORK_DIR=$HOME/aDNA/jemma_cnv
DATE=`date +%d_%m_%Y`
RUN=cnvpytor_all_${DATE} # day of run was 02_02_2019
mkdir -p $WORK_DIR/${RUN}
REF_DIR=$HOME/data/reference_genomes/Ascochyta_reference_genomes/Updated_ME14_reference_and_annotations # on Griffith HPC
# REF_DIR=$HOME/data/reference_genomes # on Awoonga HPC
GENOME="Ascochyta_rabiei_ArME14.scaffolds.fa"
# JOB_NAME="BBmap_me14_vars"
BAM_DIR="$WORK_DIR/aligned_reads_ido"
mkdir -p $BAM_DIR
ln -s $HOME/scratch/A_rabiei_WGS_2021/All_2018_2020_30_09_2021/*.csorted.bam* $BAM_DIR/ # ido only!
rename -v 's/.csorted//' $BAM_DIR/* # remove the .csorted
NCORES=8
RUN_DIR=$PWD
# prepare the reference genome
cd $WORK_DIR
cp $REF_DIR/$GENOME $WORK_DIR/
conda activate $CONDA_NAME
bgzip -c $WORK_DIR/$GENOME > $WORK_DIR/$GENOME.bg.gz
samtools faidx $WORK_DIR/$GENOME.bg.gz
# make a custom reference genome
cnvpytor -root ArME14_gc.pytor -gc $WORK_DIR/${GENOME}.bg.gz -make_gc_file
cut -f1-2 $WORK_DIR/$GENOME.bg.gz.fai | gawk 'BEGIN{printf "#!/usr/bin/env python\n\nimport_reference_genomes = {\n\t\"me14\": {\n\t\t\"name\": \"ArME14v1.5\",\n\t\t\"species\": \"Ascochyta rabiei\",\n\t\t\"chromosomes\": OrderedDict(\n\t\t\t["}{printf "(\"%s\", (%s, \"S\")), ", $1, $2 }END{printf"]),\n\t\t\"gc_file\":\"ArME14_gc.pytor\"\n\t}\n}"}' | sed 's/), ]/)]/'> ArME14_ref_genome_conf.py
# process one sample only
cd $WORK_DIR/${RUN}
alias cnvpytor_ref='cnvpytor -conf $WORK_DIR/ArME14_ref_genome_conf.py'
cnvpytor_ref -root test.pytor -rd $BAM_DIR/AR0003.bam
cnvpytor_ref -root test.pytor -his 1000 10000 100000
cnvpytor_ref -root test.pytor -partition 1000 10000 100000
cnvpytor_ref -root test.pytor -call 1000 10000 100000 > calls.10000.tsv
# Add SNP data
# cnvpytor_ref -root test.pytor -snp $WORK_DIR/A_rabiei_2018_2020.bwa.fb.stringent.vcf.gz -sample AR0003
#
ls -1 $BAM_DIR/*.bam | egrep -v "discordants|splitters" | parallel --dry-run "cnvpytor -conf $WORK_DIR/ArME14_ref_genome_conf.py -root {/.}.pytor -rd {} ; cnvpytor -conf $WORK_DIR/ArME14_ref_genome_conf.py -root {/.}.pytor -his 1000 10000 100000 ; cnvpytor -conf $WORK_DIR/ArME14_ref_genome_conf.py -root {/.}.pytor -partition 1000 10000 100000 ; cnvpytor -conf $WORK_DIR/ArME14_ref_genome_conf.py -root {/.}.pytor -call 1000 10000 100000 > {/.}_calls.tsv" > cnvpytor.cmds
echo '#!/bin/bash
#PBS -V
cd $WORKDIR
>&2 echo Current working directory: $PWD
source ~/.bashrc'"
conda activate $CONDA_NAME
set -Eeo pipefail
gawk -v ARRAY_IND=\$PBS_ARRAY_INDEX 'NR==ARRAY_IND' \$CMDS_FILE | bash" > array.pbspro
CNVpytor=$(qsub -J1-$(cat cnvpytor.cmds | wc -l) -l select=1:ncpus=6:mem=8GB,walltime=5:00:00 -N cnvpytor -v "CMDS_FILE=$PWD/cnvpytor.cmds","WORKDIR=$PWD" ${PWD}/array.pbspro | egrep -o "^[0-9]+")Find failed jobs and rerun.
# find failed jobs and retry
seq 1 $(cat cnvpytor.cmds | wc -l) | parallel "printf {} ; qstat -fx $CNVpytor[{}] | grep "Exit_status"" | grep "Exit_status = 1" > failed.cmds
# rerun failed jobs
cut -f1 -d " " failed.cmds | parallel "gawk 'NR=={}' cnvpytor.cmds" > cnvpytor_rerun_failed.cmds
CNV_RERUN=$(qsub -J1-$(cat cnvpytor_rerun_failed.cmds | wc -l) -l select=1:ncpus=6:mem=8GB,walltime=5:00:00 -N cnvpytor -v "CMDS_FILE=$PWD/cnvpytor_rerun_failed.cmds","WORKDIR=$PWD" ${PWD}/array.pbspro | egrep -o "^[0-9]+")Find failed jobs and rerun (2).
# find failed jobs and retry
seq 1 $(cat cnvpytor_rerun_failed.cmds | wc -l) | parallel "printf {} ; qstat -fx $CNV_RERUN[{}] | grep "Exit_status"" | grep "Exit_status = 1" | cut -f1 -d " " | parallel "gawk 'NR=={}' cnvpytor_rerun_failed.cmds" > cnvpytor_rerun_failed2.cmds
CNV_RERUN2=$(qsub -J1-$(cat cnvpytor_rerun_failed2.cmds | wc -l) -l select=1:ncpus=6:mem=8GB,walltime=5:00:00 -N cnvpytor -v "CMDS_FILE=$PWD/cnvpytor_rerun_failed2.cmds","WORKDIR=$PWD" ${PWD}/array.pbspro | egrep -o "^[0-9]+")
seq 1 $(cat cnvpytor_rerun_failed2.cmds | wc -l) | parallel "printf {} ; qstat -fx $CNV_RERUN[{}] | grep "Exit_status"" | grep "Exit_status = 1"
# double check that all files were produced
wc -l cnvpytor.cmds
ls -1 *.pytor | egrep -cv "ArME14_gc.pytor"
ls -1 *calls.tsv | wc -lMerge calls from multiple samples - look at this link
# entire population
pytor_files=$(ls -1 *.pytor | grep -v "ArME14_gc.pytor" | gawk -vORS=" " '1')
# login to an interactive session on a compute node
qsub -l select=1:ncpus=6:mem=8GB,walltime=5:00:00 -N cnvpytor_merge -I -X
cnvpytor -conf $WORK_DIR/ArME14_ref_genome_conf.py -root *.pytor -view 1000
# then enter these lines in the cnvpytor prompt
set Q0_range 0 0.5
set size_range 1000 inf
set print_filename A_rabiei_cnvpytor_output.xlsx
print merged_calls
# 2020 population
pytor_files=$(ls -1 AR0*.pytor | gawk -vORS=" " '1')
# login to an interactive session on a compute node
qsub -l select=1:ncpus=6:mem=8GB,walltime=5:00:00 -N cnvpytor_merge -I -X
cnvpytor -conf $WORK_DIR/ArME14_ref_genome_conf.py -root *.pytor -view 1000
# then enter these lines in the cnvpytor prompt
set Q0_range 0 0.5
set size_range 1000 inf
set print_filename A_rabiei_2020_cnvpytor_output.xlsx
print merged_calls
# alternatively: create script with cnvpytor commands (not working)
echo "set Q0_range 0 0.5
set size_range 1000 inf
set print_filename A_rabiei_cnvpytor_output.xlsx
print merged_calls" > merge_calls.spytor
# submit job to the cluster
echo "cd $PWD; source ~/.bashrc'; conda activate $CONDA_NAME; < merge_calls.spytor" | qsub -l select=1:ncpus=6:mem=8GB,walltime=5:00:00 -N cnvpytor_mergeInteractive analysis with CNVpytor
The data produced by CNVPytor was then exported to CloudStor (folder jemma_cnv) and visualised using a Jupyter notebook on CloudStor SWAN. See explanation of output in Predicting CNV regions
As an alternative to SWAN, the data can be analysed interactively bu launching a Jupyter notebook directly from the Gowonda HPC (see documentation here)
WORK_DIR=$HOME/aDNA/jemma_cnv
cd $WORK_DIR
mamba install -n cnv jupyterlab
conda activate cnv
python -m ipykernel install --user --name cnv --display-name "cnvpytor"
# creating a PBS script to run Jupyter
echo '#!/bin/bash
#PBS -m b
#PBS -M jemma.tennant@griffithuni.edu.au
#PBS -N cnv_jupyter
### Add current shell environment to job (comment out if not needed)
# The jobs working directory
cd $PBS_O_WORKDIR
source ~/.bashrc
source ~/.proxy
# source /usr/local/bin/s3proxy.sh
unset PYTHONPATH
conda activate cnv
jupyter lab --no-browser --port=5678' > cnvpytor_jupyter.pbs
qsub -l select=1:ncpus=6:mem=8GB,walltime=600:00:00 cnvpytor_jupyter.pbsThe core SNPs were initially filtered by TASSEL (Bradbury et al. 2007) and then imported to R and were used to analyse the population structure using a range of popgen packages (adegenet v2.1.7 , poppr v2.9.3, mmod v1.3.3) (Jombart and Ahmed 2011; Winter 2012; Kamvar et al. 2014; Jombart et al. 2020)
The SNP data was used to calculate a distance matrix between all samples, which was then used as input for principle component analysis (PCA). The PCA was plotted as an interactive plotly plot (v4.10.0) to allow exploration of the data and selection of isolates for further experiments/screening based on their phenotype, collection origin and genetic makeup.
Variants were then further filtered to keep only polymorphic and bi-allelic SNPs using the VariantAnnotation v1.42.1 and GenomicRanges v1.48.0 Bioconductor R packages (Lawrence et al. 2013; Lawrence and Morgan 2014; Obenchain et al. 2014). The SNPs were annotated and position of each variant relative to gene locations was determined (coding/intron/promotor/intergenic), based on the reference genome and gene models of A. rabiei strain Me14 (where the contig names were shortened to remove the Arab_Me14 prefix).
cat A.rabiei_me14.fasta | sed 's/Arab_Me14_//' > A_rabiei_me14_short_names.fasta
zcat Arab_me14.gff3.gz | sed 's/Arab_Me14_//' > Arab_me14_short_names.gffVariants in coding regions of genes were identified as Synonymous/Non-synonymous/Non-sense mutations if they were silent, changing the amino acid or the reading frame, respectively. Additional annotation of the genes at each SNP site was performed by a BLASTp search against the NCBI non-redundant protein database (nr) and scanning against the InterPro conglomerate dbs.
# Download the python wrapper
$HPC_GRID_RUNNER_DIR/BioIfx/hpc_FASTA_GridRunner.pl --cmd_template "cd $PBS_O_WORKDIR; pyenv shell miniconda3-latest; ~/etc/tools/Annotation/InterPro/iprscan5_urllib3.py --sequence=__QUERY_FILE__ --outfile=__QUERY_FILE__.interpro.tsv --outformat=tsv --goterms --pathways --email=i.bar@griffith.edu.au " --query_fasta Assoc_SNP_genes_cds_05_02_2018.fasta -G $HPC_GRID_RUNNER_DIR/hpc_conf/small_PBS_jobs.conf -N 1 -O SNP_interproEffector prediction of the variant-associated genes was performed by Predector v1.2.6 (Jones et al. 2021), following the installation and usage instructions in the online Wiki documentation (downloading the external dependencies and building a conda environment).
# building a conda environment with the external dependencies
ENVIRONMENT=conda
curl -s "https://raw.githubusercontent.com/ccdmb/predector/1.2.6/install.sh" \
| bash -s "${ENVIRONMENT}" \
-3 signalp-3.0.Linux.tar.Z \
-4 signalp-4.1g.Linux.tar.gz \
-5 signalp-5.0b.Linux.tar.gz \
-6 signalp-6.0g.fast.tar.gz \
-t targetp-2.0.Linux.tar.gz \
-d deeploc-1.0.All.tar.gz \
-m tmhmm-2.0c.Linux.tar.gz \
-p phobius101_linux.tar.gz
CONDA_PATH=$(conda env list | gawk '$0~/predector/{print $2}')
mkdir -p ~/scratch/sandbox/predector_ME14
cd !$
conda activate nf
# nextflow run -profile gowonda -c $HOME/.nextflow/GU.config -with-conda $CONDA_PATH -r patch-2 \
# IdoBar/predector --proteome Ascochyta_rabiei_ME14_cds-proteins.faa
nextflow run -profile gowonda -c $HOME/.nextflow/GU.config -with-conda $CONDA_PATH -r 1.2.6 \
ccdm/predector --proteome Ascochyta_rabiei_ME14_cds-proteins.faa Look for particular effector genes and genes associated with plant-pathogen interactions and pathogenicity genes. Look in the literature and create a list of target genes.
The variants were summarised and visualised across the genome scaffolds and visualised using circlize v0.4.15 R package (Gu et al. 2014).
Association between variants and pathogenicity levels was identified with GAPIT v3.2 R package (Wang and Zhang 2021), using a XXXX model and a significance threshold of p-value \(\le0.05\).
This document was last updated at 2022-08-29 16:11:23 using R Markdown (built with R version 4.2.0 (2022-04-22 ucrt)). Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. It is especially powerful at authoring documents and reports which include code and can execute code and use the results in the output. For more details on using R Markdown see http://rmarkdown.rstudio.com and Rmarkdown cheatsheet.